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Abstract 

This paper describes MTP: a reliable transport protocol that uti- 
lizes the multicast strategy of applicable lower layer network architec- 
tures. In addition to transporting data reliably and efficiently, MTP 
provides the client synchronization necessary for agreement on the re- 
ceipt of data and the joining of the group of communicants. 

Keywords: reliable transport, multicast, broadcast, atomic broad- 
cast, agreement. 


1 Introduction 

A multicast transport is a virtual circuit connection among a set of commu- 
nicating peer-level processes. As such, any multicast transport protocol has 
to satisfy somewhat conflicting goals. Being a transport protocol, it should 
supply quick and reliable delivery of large amounts of client data among the 
communicants. Yet, being a multicast protocol, it should be able to supply 
the ordering and agreement on the delivery of the data that is necessary 
for writing decentralized applications. Agreement on order and delivery can 

‘Keith Marzullo is supported in part by the Defense Advanced Research Projects 
Agency (DoD) under NASA Ames grant number NAG 2-593, Contract N00140-87-C- 
8904. The views, opinions, and findings contained in this report are those of the authors 
and should not be construed as an official Department of Defense position, policy, or 
decision. 
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take time, thereby slowing the delivery of the data. Hence, most multi- 
cast protocols concentrate on a smaller set of goals; for example, [CPS8] 
and [CW89] concentrate on fast delivery while [KTHB90] concentrates on 
the fast ordered delivery of relatively small messages. 

MTP, the transport described in this paper, attempts to satisfy both of 
these goals. MTP is a full-duplex, flow- controlled, reliable multicast protocol 
in which the data is sequenced into (perhaps long) messages. Messages are 
sent within a process group called a web, where each message has a single 
sender and is received by all members of the web. The members of the web 
agree on the order of receipt of all messages and can agree on the delivery 
of the message even in the face of partitions 1 . 

MTP can be thought of as two protocols: a transport layer running un- 
derneath an ordering and agreement layer. The transport layer is a negative 
acknowledgement (or NAK) based protocol exploiting the high probability 
of successful message delivery that the local area networks of today pro- 
vide [CLZ87], Additionally, this transport utilizes the underlying data link 
and physical layer’s capability to do multicast addressing. The ordering and 
agreement protocol uses a sequencer site [CM87,KTHB90] called the master 
that grants serialized tokens to producers. 

The rest of this paper proceeds as follows. In Section 2, the class of 
applications for which MTP is meant is contrasted with those applications 
other atomic broadcast protocols support. The protocol is presented in 
Section 3. Suggestions for values of the protocol’s parameters are derived in 
Section 4, and a discussion of MTP is given in Section 5. 

2 Applications 

MTP is designed to support applications that consist of a large number of 
processes, where the processes send large messages and where the appli- 
cation must be fault- tolerant (we consider crash failures of processes and 
communication link failures that can lead to partitioning). Examples of 
such applications include multimedia teleconferencing systems, multiscreen 
educational systems, and stock brokerage systems. In making this assump- 
tion, we intentionally exclude certain classes of applications that have been 
considered elsewhere; in particular, those structured as client-server systems 
with highly available services (e.g. [Sch86,MS88]). 

1 A partition is the separation of a network of processes into two or more disjoint sets 
that cannot communicate with each other. 
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One issue that MTP must address is the efficient handling of network 
partitioning. An argument can be made that transient partitioning is a 
very common failure in the kind of applications we are considering [Cri90], 
Timeouts are used to detect both crash failures and communication failures. 
If a machine uses a timeout period that is too short, then it will appear to 
the machine that the network has temporarily partitioned. For CSMA/CD 
type data links, there is no upper bound on message delay (communication 
and operating system software can also increase the variance of this delay), 
so such transient partitions will be unavoidable. The application designer 
must balance the cost of recovery from partitioning against the penalty of 
using excessively long timeouts. Additionally, packets can be dropped due 
to temporary congestion at both routers and workstations, again creating 
transient partitions. 

Our approach to tolerating partitions is to choose one process in the web 
to be a distinguished process called the master. Since an MTP web contains 
such a distinguished process, partitions can be treated in the same way as 
crash or timing failures . If the master process po cannot communicate with 
a member process pi , then po assumes that p\ has failed. If pi has instead 
partitioned away from po, Pi will know that po considers p\ to have failed 
and behave accordingly. The vulnerability of a web to the failure of the 
master is a matter of concern, however. If the application is to be long- 
lived, care must be taken in choosing the machine that runs the master. In 
Section 5.2, we discuss some techniques for making a master more robust. 

Most other atomic broadcast algorithms are structured in a very decen- 
tralized manner so the failure of any (usually size-bounded) subset of the 
processes will not cause the application to fail. Being fault-tolerant in this 
manner is very important for implementing highly-available services, but 
it means that the complex issue of tolerating partitions in a decentralized 
manner must be addressed [DGMS85] 2 . 

A more detailed description on the issues and uses of reliable broadcast 

protocols can be found in [JB89]. 

5 A notable example of an atomic broadcast protocol that does not have a decentralized 
structure is described in [KTHB90], although as presented, this protocol cannot tolerate 
partitioning. 
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3 Protocol 

Section 3,1 describes the overall structure of an MTP web. In Section 3.2, the 
ordering and agreement protocol is described assuming an abstract trans- 
port protocol. In Section 3.3, the transport protocol is described, and in 
Section 3.4 the ordering and agreement protocol is extended to support the 
establishment of a web and the joining of a member. 

3.1 Web Structure 

An MTP web consists of a master process and a set of member processes. 
Member processes may join and leave the web, but the master process can- 
not, as the web is both instantiated and terminated by the master. All 
data is reliably multicast : that is, every process agrees on the order that 
a given message will be processed, and the transport guarantees that any 
given message is either accepted by all non-failed processes or not accepted 
by any non-failed processes. 

There are four transport service access points (TSAPs) associated with 
a given web: 

1. Multicast transport addresses : These are the addresses to which all 
messages targeted for the entire web are transmitted. Each consists 
of a multicast network service access point (NSAP) catenated with a 
unique transport connection identifier. 

2. Master ’s transport address : This is the TSAP for the master process. 
This address is the destination of messages for the master process, 
such as requesting a token or leaving the web. This address is also the 
source of any message sent by the master process. 

3. Join transport address : This is the NSAP for the service 3 catenated 
with the predefined join transport connection identifier. This address 
is the destination of all requests to join the web. 

4. Member transport addresses: These are the addresses of all the pro- 
cesses that are currently members of the web. Each consists of the 
member process NSAP catenated with a unique transport connection 
identifier. The source of any packet transmitted by a process, regard- 
less of the packet’s destination, is a member of this set. 

determining this multicast NSAP for a given instantiation is not a function performed 
by MTP. 




MTP: An Atomic Multicast Transport Protocol 


5 


3.2 Sequencing Messages 

The agreement and ordering layer of MTP ensures that all processes agree on 
which messages are accepted and in what order they are accepted. Let p, be 
a member process and M, be the sequence of messages that pi has delivered 
to its client. The agreement layer ensures the following two properties: 

AB-1 The sequence of messages that processes have delivered to the clients 
do not diverge; that is, for all processes p, and p } , M, is a prefix of M } 
or Mj is a prefix of M,. 

AB-2 There exists a connected subset of the nonfaulty (t.e. noncrashed) 
processes that make progress. 

Figures 1, 2 and 3 shows pseudocode for the MTP agreement and or- 
dering protocol. In these figures, the primitive send p x sends the message 
x to process p without blocking, the primitive receive p x is a CSP-like 
guard [Hoa78] that receives a matching message from process p and stores 
it into x, and multicast P x multicasts the message x to the processes in 
the set P without blocking. The predicate failed(s) represents a timeout; it 
will become true at some point after the processor that was issued the token 
for message number s has crashed or remained partitioned away from the 
master. 

To send a message m to a web, a member process first requests one of a 
set of t tokens from the master of the web. This token contains: 

• the message number to be assigned to m, 

• the multicast transport addresses as discussed in Section 3.1, 

• the status of the last t messages. Such messages can be accepted , 
rejected , or pending. Furthermore, the earliest of these t messages 
must either be accepted or be rejected. 

The master sets the status of the last t messages using the following rule. 
Let m is one of these last t messages: 

• if the master has seen the message m , then the status is accepted 4 ; 

4 The master has seen message m when it has received a data packet of message m 
containing an end of message indicator; see Section 3.3. 
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process Master 
begin 

members: integer set 

status (1.. t): Status := undefined undefined ; 

t: constant integer; the number of tokens 
next: integer := 1; 

s: integer; 
m: Data; 

last (1.. t): Status; 

do receive Sender(l: l..n) [“token. request”] and status(t — 1) ^ pending — 
begin 

5tatu$(l.. t) ;= pending , status(l.. t — 1); 
next := next A 1; 

send Sender(i) [ “token, grant” , next, status, members] 
end 

J receive Receiver(i: l..n) [“data", s, m, last] — ► 

if next — s < t and status(next — s) = pending 
then status(next — s) := accepted; 

(] failure(s) and next — s < t 

and status(next - s) = pending — ► status(next — s) = rejected 
od 
end 


Figure 1: Agreement Protocol for Web Master 

• if the master has not seen the message m but the sender of m is 
still operational and connected to the master (as determined by the 
master), then the status is pending ; 

• otherwise, the status is rejected . 

An abbreviated proof of this protocol is presented in the Appendix. In- 
formally, the specification is met because the behavior of the web is defined 
by the behavior of the master. In particular, a member process accepts a 
message m only if the master accepts m, and all messages are accepted in 
the order of their message sequence numbers; thus, AB-1 is met. We de- 
fine the connected subset of correct processes referred to in AB-2 as those 
processes 5 that remain connected to the master. The master will accept 
messages sent by processes in 5 and possibly reject other messages, and the 



MTP: An Atomic Multicast Transport Protocol 


process Sender(i: 1.. n) 
begin 

last (1.. t): Status; 
members: integer set; 
s: integer; 
m: Data; 

do receive producer(i) [m] — * 
begin 

send Master ["token. request”]; 
receive Master ["token. grant”, s, last, members]; 
multicast {Master} U Receiver(members) ["data”, s, last, m] 
end 
od 
end 

Figure 2: Agreement Protocol for Web Producer 

members of 5 will in turn accept and reject these same messages as other 
messages are sent 5 . 

Having obtained a token, s multicasts message m with the token for mes- 
sage m included in the header of the data packets that carry m. Processes 
learn the status of earlier messages by seeing such packets, and can accept 
and reject messages accordingly. This protocol can tolerate up to a sequence 
of t failures; if there are t + 1 failures, then the master could send tokens 
to these processes which could then fail before any nonfaulty process sees 
any data sent with these t + 1 tokens. The headers of these tokens carry 
information about the status of earlier messages, and since no other process 
received any data sent with the earliest token, the status of some message 
will never be propagated to the members of the web. 

3.3 Sending Messages 

The transport multicast layer of MTP is implemented using the multicast 
capability provided by the network layer (which in turn is provided by the 
data link and physical layers). For the purposes of this paper, we assume 

»As written, a message c»n be acknowledged (and hence delivered) only when another 
message is sent. However, the master can send empty packet t, defined in Section 3.3, in 
order to expedite the delivery of a message when subsequent messages are slow in being 
generated. 
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process Receiver(i: 1.. n) 
begin 

data (1.. ): Data := empty .. ; 
status (1.. ): Status := pending .. ; 
nextln, nextOut: integer := 1, 1; 

last (1.. t): Status; 
s: integer; 
m: Data; 

do receive Receiver(j: l..n) ["data”, s, last, m] — ► 
begin 

k: integer := 2; 
data(s — 1) := m; 

do k < t — ► status(s - k) := last(k); k := k + 1 od; 

nextln := max s, nextln 

end 

(] receive consumer(i) and status(nextOut) = accepted — * 
if data(nextOut) empty — ♦ 

send consumer(i) [data(nextOut)]; nextOut := nextOut + 1 
(] data(nextOut) = empty — * rejoin 
fi 

0 status(nextOut) = rejected — ► nextOut := nextOut + 1 

0 status(nextOut) = pending and (nextln - nextOut) > t — ► rejoin 
od 
end 


Figure 3: Agreement Protocol for Web Consumer 

that a multicast to all of the processes in a web can be accomplished by 
performing multicasts to a small number of transport service access points 
(TSAPs) — no more than can be included in the data portion of an MTP 
packet. Network facilities similar to those described in [DC90] support this 
facility, but are not necessary for MTP to operate. 

The transport layer treats a message as an uninterpreted sequence of 
bytes terminated by an end of message marker. The transport layer frag- 
ments a message into a sequence of packets. Each packet carries a sequence 
number of the form (m,p) where m is the message number and p is the packet 
number in this message, starting at zero. For example, if message 5 were 
broken into 3 packets, then the packets would be sequenced as (5,0), (5,1) 
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and v 5 , 2 ) (of which the last would carry an end of message marker ), and 
the next packet would be sequenced as (6.0). 

There are three parameters that control the flow of data in the transport 

layer. They are: 

• heartbeat: A base unit of time, in milliseconds. 

• window: The maximum number of data packets a producer can send 
during any heartbeat. 

• retention: The maximum number of heartbeats a producer must 
buffer packets for possible retransmissions. 

Data is transmitted in a burst of packets such that no more than the 
current window of data packets will be sent during a single heartbeat. Every 
packet transmitted (including control packets) always contains the latest 
heartbeat, window and retention information along with the statuses of the 
previous t messages and the next message sequence number. If full packets 
are not available 6 , empty packets will be transmitted instead (defined below). 
The only data packets that will be transmitted containing less than the 
maximum capacity will be those that mark a client state transition. 

A empty packet is a control packet that is multicast into the web at reg- 
ular intervals whenever the producer owning a token cannot transmit client 
data. Empty packets are sent to maintain synchronization and to advertise 
the maximum sequence number of the producer. Empty packets provide the 
opportunity for consuming processes to detect and request retransmission 
of missed data as well as identifying the owner of a transmit token. 

If a producer receives a NAK from a consumer requesting the retransmis- 
sion of one or more packets, those packets will be multicast to the entire web 
or to a selected subset of the multicast TSAPs. All retransmitted packets 
will contain the original client information and sequence number. However, 
the retransmitted packets will contain updated parameter information (the 
heartbeat, window and retention). As no more than than a full window of 
data messages can be sent during one heartbeat, retransmitted packets have 
priority over new packets during the next heartbeat. 

The producer is obligated to retransmit a packet upon request for at least 
retention heartbeats after its original transmission (even after the message 
has been completely sent). If the producer receives a NAK from a consumer 

®The resource being flow controlled is a packet carrying client data. Consequently, full 
packets provide the greatest efficiency. 
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process requesting the retransmission of a packet that is no longer available, 
the producer sends a nak deny to the source of the request. If the consumer 
cannot recover from the loss of this packet, then the consumer rejoins the 
web to resynchronize. 

Figure 4 shows a space-time diagram of a process transmitting into a 
web assuming no NAKs, and Figure 5 illustrates data transmission and 
NAK processing. 

3.4 Consistency and Joining the Web 

A process p t may become unrecoverably inconsistent with the master of the 
web for several reasons. The most likely reason is that p, has partitioned 
away long enough from the master so that pi missed learning the status of a 
message. A less likely scenario is that some process pj transmits a message 
that is received by the master but not by p, , and pj crashes before pi can ask 
for retransmission of the missed packets. In any case, when a process finds 
itself inconsistent with the master, it can resynchronize itself by rejoining 
the web. 

As described in Section 3.1, the master of a web constructs the master 
transport address by catenating the NSAP with a locally generated unique 
transport connection identifier. A process that wishes to join or rejoin the 
web will send a join request message to the join transport address, and the 
master will answer with a join response carrying a source of the master 
transport address. Note that a rejoining process can determine whether the 
web is the same session with which it became inconsistent by comparing 
the previous and new transport connection identifiers it obtained in the join 
confirm messages. 

In general, a process that repeatedly receives no join confirm cannot 
elect itself the master. Another process may follow the same reasoning in 
another partition, and then if the partition were to end, there would be two 
inconsistent webs with undesirable properties; for example, a third joining 
process would nondeterministically join one of the two existing webs. Any 
“merging” of such inconsistent webs would have to be done outside of MTP, 
as the semantics of such a merge would depend on the application. A better 
method for master selection would be for a process to know a priori if it 
were the master or not. Doing so would both guarantee that there exists 
only one active web with a given NSAP and would allow the master to be 
located on a machine that is known to be reliably available. 

Having joined a web, a process p must be informed which message it 
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should first accept. If p does not need to be given any state in order to 
process the next message, then the master can immediately reply to the join 
request message with a join confirm message containing the sequence number 
s of the next token the master will hand out. Then, the joining process p 
need only start receiving messages with sequence numbers greater than or 
equal to s. However, for some applications p would need to be initialized 
with the state of the web after all message before s have been accepted or 
rejected. In this case, having received a join request, the master will stop 
granting token requests and will delay sending a join confirm message to p 
until all message before s have been accepted or rejected. Then, the master 
can respond with the join confirm, p's state can be initialized (either by 
having the master send the state or through a protocol outside of MTP), 
and the master can resume granting tokens. 

Figure 6 shows a space- time diagram illustrating the sequence of mes- 
sages during a join with a transfer of state from the master. 


4 Parameter Values 

The values of heartbeat, window and retention can be adjusted by the trans- 
port to reflect the capability of the members, the type of application being 
supported and the network topology. In general, the producers will try to 
drive these numbers towards a higher performance level, and the consumers 
will try to drive these numbers towards a higher reliability level. By doing 
so, both are trying to optimize the quality of service. 

Producers can try to improve the performance by reducing the heartbeat 
interval and by increasing the window size. This will have the effect of 
increasing the resources committed to the transport at any time. To level 
the resource commitment, the producer may also reduce the retention. In 
the worst case, a producer must commit enough storage to hold window size 
x retention maximum-size packets for heartbeat x retention milliseconds. 

Consumers must rely on their clients to consume the data occupying the 
resources of the transport. The consumer transport implementation must 
monitor the level of committed resources in order to ensure that resources 
are not overcommitted. Since MTP is a NAK-based protocol, the consumer 
is required to inform a producer if a change in parameters is required. A 
consumer must be capable of committing at least t times the memory com- 
mitted by a producer. 

For more reliable operation, a consumer would try to extend the heart- 
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beat interval and increase retention. This has the effect of increasing the 
resources needed to support the transport. To counteract this, the consumer 
could reduce the window. 

In order to make these parameters more concrete, consider MTP running 
on a collection of 1-MIP workstations with local industry-standard disks, 
communicating over a IEEE 802.3 local area network. The heartbeat is 
approximately the transport time constant. Assuming that the transport 
can be modeled as a closed loop function, reaction to feedback into the 
transport should settle out in three time constants. In a transport that is 
constrained to a single network, the dominant cause of processing delay will 
most likely be the page fault resolution time. The time to service a page 
fault is overwhelmingly the disk access time, and for the current industry- 
standard disks, around 40 milliseconds is the average worst-case access time. 
In the worst case, this time could double in order to reclaim a dirty page. 
Allowing for additional overhead and scheduling delays, two times the worst 
case page fault resolution time should be a suitable minimum transport time 
constant, which is 160 milliseconds. 

The window is the number of packets that can be consumed during 
one heartbeat. For IEEE 802.3 local area networks, the transmit time per 
packet is 1.2 milliseconds for a full packet of 1500 bytes. The processing 
time on a 1-MIP machine running Unix should be around 5 milliseconds for 
a full packet (where 2.5 - 3 ms of this is incurred by the operating system). 
Assuming that the data for the packet originated from a disk backing store 
and that disk service overhead is comparable to network service overhead, 
the resulting overhead is 11.2 milliseconds per packet, corresponding to a 
bandwidth of 1 Mbit/sec. During a heartbeat of 160 milliseconds 14 packets 
can be sent, so the maximum window would be approximately 14 packets 
per heartbeat. 

At worst, each producer could consume 10 percent of the available net- 
work bandwidth, so MTP will not be limited by the network bandwidth. 
Each producer consumes about 80 percent of the consumer’s processing 
time, so having more than one producer outstanding could saturate a con- 
sumer. However, to a point, having multiple tokens allows some producers 
to acquire a token shortly before it is required (presumably overlapping the 
transmission of an earlier message) without locking out another producer. 
Additionally, increasing t decreases the average message delivery time (until 
thrashing becomes a problem). Since the peak resource requirement scales 
linearly with t, a reasonable value of t would probably be two or three. 

Reducing retention may introduce instability because a consumer will 
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have less opportunity to react to missing data. Data can be missed for 
a variety of reasons. If constrained to the local net, the data lost due to 
corruption should be around one packet in 50,000 7 . Four orders of magnitude 
more packets are lost at receiving stations, including packet switch routers, 
than over physical links. The losses are usually the result of congestion and 
resource starvation at lower layers due to the processing of (nearly) back to 
back packets. One can only require that a receiving station be capable of 
receiving some number of back to back packets successfully, and that number 
must be at least greater than the window size. The probability of success 
can be made as high as needed by providing the receiver the opportunity to 
observe the data multiple times. 

At worst, the receiving station detects packet loss using timers. Such 
timers might have a granularity of more than two orders of magnitude 
greater than the maximum packet transmit time. As such, the worst case 
is much worse than detecting data loss due to gaps in sequence numbers. 
When the loss is detected, the response (a NAK) is transmitted and should 
be received at the producing process in less than two heartbeats after the 
data it references was transmitted. Again, it is the detection time that dom- 
inates, not the transmission of the NAK. NAKs are also subject to loss, but 
the probability of delivery can be made close to one by retransmitting. In 
order to be able to respond to a second NAK, the minimum retention is 
three. 

The resources committed to a transport using the above assumptions are 
buffers sufficient for 126 packets of 1500 bytes each, and each buffer will be 
committed for at least 480 milliseconds. 

The parameters would be very different for a web that spans an internet- 
work of several LANs, and could be adjusted to accommodate the properties 
of the network. For example, if a producer is separated from a set of con- 
sumers by a router and the router drops a packet due to congestion then all 
of the consumers will simultaneously send NAKs, further aggravating the 
congestion. To avoid this burst of NAKs, the master could have previously 
set the web’s retention to / + 3 for some positive value of /. Each NAKing 
consumer would then dally for some number of heartbeats between 0 and 
/ before NAKing a missed packet. Not only would this dallying reduce the 
number of simultaneous NAKs by a factor of /, but most processes would 
probably receive the retransmission without sending a NAK. 

telephone links (between routers, for example) are capable of exhibiting similar cor- 
ruption rates. 
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5 Discussion 

5.1 Number of Tokens 

In Section 4, it was argued that a reasonable number of tokens would be 
around two or three. It isn’t clear what the number of tokens should be 
when a web spans a larger collection of networks. On one hand, having more 
tokens allows more processors to pre-allocate tokens, thereby overlapping the 
longer round-trip message time with (hopefully) other processing. On the 
other hand, the maximum number of buffers increases with the number of 
tokens, and processors distant from the master are more likely to partition 
away from the master, thereby increasing the number of failures. 

One can allow the master to find a balance by varying the number of 
tokens. This is done by logically splitting t into the two values t max , which is 
the maximum number of tokens that can be outstanding and is the n um ber of 
message statuses carried in a header, and t cur , which is the current maximum 
number of tokens that can be outstanding and need be known only by the 
master. The number of failures that can be tolerated is determined by 
tmax (see the discussion in the Appendix). The master could then vary t cur 
between 1 and t max depending on the web performance. 

5.2 Resiliency Against Failure of the Master 

The main vulnerability of MTP is that the failure of the master can cause 
the web to fail. For some applications ( e.g., a stock brokerage system), 
such a failure could be intolerable. In this case, it would become desirable 
to replicate the web master. Replicating the master for high tolerance to 
processor failure can be done without changing MTP, but having a replicated 
master would be noticed by the members as an increase in the response time 
to a token request (and less importantly, to a join request). 

All the master replicas Po>Po> ■ -Po would reside on an unpartitionable 
network (for example, a single local area network), guaranteeing that if a 
member pi is connected with Pq and member p? is connected with p$, then 
Pi is connected with p§ and pi is connected with pj. The web’s master 
TSAP would be a multicast address for these replicated masters. 

The masters would choose one amongst themselves to be the coordinator , 
with the rest being cohorts [BJ87], Any replica receiving a request would 
atomically broadcasts the request to all the master replicas before the co- 
ordinator would respond. Similarly, when the coordinator decides that a 
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mcssugc becomes accepted, the coordinator would first utomicBlly broad- 
casts this fact to all the master replicas 8 . If the coordinator were then to 
fail, one cohort would become the new coordinator. This new coordinator 
would reject all messages that it considered pending and start responding to 
master requests. 

5.3 Web Membership 

One issue we have not discussed in this paper is how a process can determine 
the current membership of a web. Knowing this information can be very 
useful; for example, if all the processes agree on the current web membership, 
then each can agree a priori on how work should be partitioned amongst 
themselves. The group membership problem is essentially that of having the 
web members agree on when a process joins the web and when a process 
leaves the web (either by failing, by partitioning away, or under its own 
volition) [Cri88,Ric90], The difficulty with the group membership problem 
is that it really cannot be “solved”; since a process can fail without notifying 
any other process, a member of a web cannot be sure whether or not another 
process is currently a member. The best that can be done is to have the 
web members agree on the membership of the web, and accept the fact that 
there may be members that have crashed, and that there may be processes 
that, due to the asynchronism in the system, have been excluded from the 
web even though they have not crashed or partitioned away . 

Group membership protocols operate by having processes monitor each 
other. If a process j/ decides that another process p has failed, then p' uses 
some reliable broadcast protocol to disseminate this information to the other 
web members [Ric90]. A common method of detecting whether a process 
p has failed or not is to use low-level “alive” messages: other processes 
periodically expect such messages from p (perhaps as the result of periodic 

9 As stated in this paper, the only time a member learns the status of a message is 
when it receives a token or a data message from another member. So, if the coordinator 
were to notify the cohorts before granting a token, then the cohorts would be consistent. 
However, in the actual protocol the master may send periodic empty packets to expedite 
the delivery of messages. If this empty packet advertises a new status, then the coordinator 
must inform the cohorts 

•Web members must be careful in the deductions they make from the purported group 
membership. For example, even if a process p was a member of a web through the de- 
livery of some message m, other web members cannot assume that p actually processed 
any message ordered before m unless p specifically acknowledged this fact. To do oth- 
erwise would be assuming a solution exists to the coordinated attack problem, which is 
unsolvable [Gra79]. 
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requests), and assume that p has failed if such messages cease to arrive. 
Once all web members agree that p has failed (even if it has not ), the new 
web membership is defined. 

Since MTP is a NAK-based protocol, there is no defined low-level “alive" 
protocol. A web membership protocol, however can be implemented on top 
of MTP as part of the application protocol. Each web member maintains 
a set that contains the current web membership. When a process p joins a 
web, p multicasts this fact to the web, and all web members (including p) 
add p to their membership set when they receive this message. Similarly, if 
a process pf decides, for any reason, that another process p has failed, then 
p' multicasts this fact to the web. If p' is still a member of the web when 
this message is delivered, then each process (including pf) removes p from 
its membership set when it receive this message. 

Such membership information is of interest to the master. As discussed 
in Section 3.1, the master includes a list of multicast TSAPs in a token 
grant message. This list of TSAPs covers the membership of the web as 
known by the master, which as currently presented may not be the same 
as the membership set described above. The solution in MTP is to allow 
a producer and receiver to execute with the master. These processes can 
exchange membership changes each observes-the master seeing token losses 
and the receiver seeing member-observed failures. By doing so, the master 
can remove a multicast TSAP from its list when all processes reached via 
that TSAP have left the web, and the producer can multicast the removal 
of a member process when that member loses a token. 

5.4 Conclusions 

MTP is a multicast transport that supports the strong conditions of agree- 
ment on delivery, agreement on order and agreement on web membership. 
An implementation of MTP is currently under way. 
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Appendix: Specification and proof 

This appendix presents a specification and a proof of the ordering and agree- 
ment protocol. In interest of brevity, the proof is somewhat informal and 
incomplete; in particular, several simple lemmas are stated and used without 
proof. 

Let p 0 be the master process and p i through p„ be the member processes. 
The sequence of messages that pi has delivered to its client is denoted as M,, 
and we write M, ~ to mean that Mi is a prefix of M } or M } is a prefix 
of M,. Similarly, we will denote by Ai the messages that p, has marked as 
accepted and R t the messages that p, has marked as rejected. Both R 0 and 
Ao are defined, but as there is no client of the master. Mo is not defined. 
The sequence number of a message sent with the statement multicast ... 
(“data", s, last, m] is s - 1, which we will denote as m.seq 10 . We will write 
mi < m 2 as shorthand for m^seq < m 2 .seq A € Ao A m 2 G Ao- 

The subset of processes that are not faulty are denoted as C. The state 
predicate conn(pi,pj) is true when p, and p, are connected, the state pred- 
icate send{m,pi ) is true when p, sends message m, the state predicates 
produce(i) and consume{i) are true when the client on pi requests a message 
to be sent and requests data respectively, and the state function 5 is a subset 
of the processes pi,P 2 , Pn- 

The specification consists of two properties. The first is a safety property, 
which specifies that “bad” states do not occur, while the second is a hveness 
property, which specifies that “good” states will eventually occur. 

AB-1 The sequence of messages delivered to the clients do not diverge. 


□ (VPnPT Mi ~ Mj) 

AB-2 There exists a connected subset S of the correct processes C that 
make progress: 

10 Formally, any reference tomii actually a reference to m.seq. The values of R, and 
Ai for • > 0 are state functions whose values are defined by the array Producer.status 
and Producer.data: if there exists a state in which process has status[i] - accepted and 
data!*] # empty, then in that state m: m.seq = k: m 6 A„ and if there exists a state 
in which process p, has status[fc] = rejected, then in that state m: m.seq - k: m € . 

We can then define m € Mi as m € A, A nextOut > m.seq. Similarly, the values of Ko 
and Ao are defined by the array Master.status; if in some state status[*] - accepted, then 
henceforth m: m.seq = next - k: m 6 Ao, and if in some state statusffc] - rejected then 
henceforth m: m.seq — next — k\ m € Z2o. 
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□ (Vm,pi,pj: p,,pj e C : send{m,pi) A □ (p,,p, G S) => 
O m 6 Mj) 


Our assumptions are: 

• All and A, axe initially empty; 

• the master never fails: po € C; 

• conn(p,,pj) is an equivalence relation (».e., it is symmetric and tran- 
sitive); 

• unbounded fairness is followed in the selection of enabled guards i.e., 
a guard that remains true will eventually be selected; 

• clients on correct processors always continue to send messages and 
consume messages: 

□ Vp,: pi gC: O produce^ i ) A O consume{i) 

Additionally, we will assume without proof that the protocol satisfies the 
following three lemmas: 

1. The delivery of a message is monotonic: 

L\\ □ (Vm,p,: (m € Mi) =>• □ (m € Af,)) 

2. A process cannot both accept and reject the same message: 

I 2 : □ (V Pi : (m G A,) => (m * R,)) 

3. Clients receive messages in message sequence number order: 


X 3 : □ (Vmi,p,: mi G Af, => 

(Vm 2 : m 2 .aeq < m x .seq: m 2 G Af, V m 2 G 50) 
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Showing Safety One can show a program satisfies a safety property E by 
finding a property I such that the initial conditions Init imply I. I implies 
□ I , and I implies E. For I, we will use the conjunct of the two predicates 
Ii and iV 

I\: Vm,p t : (m £ i?,) =» (m £ i?o) 

J 2 : Vmi , 77i 2 ,p,: (mi < m 2 A mi ^ Afj) ^ m 2 

Initially, all Af, are empty, making the antecedents of h and / 2 both 
false; thus, Init => I. To show Ij and / 2 implies AB-1, note that together 
they' state that for all p„ M, is a prefix of M 0 . Since M, is a prefix of M 0 
and AT,- is a prefix of M 0 , at least one of (Af„ Af,) is a prefix of the other, 
meaning M, ~ Mj. 

We now prove I\ => □ T . By 7i , /i can become false only if a member 
Pi rejects a message m before po rejects m. For pi to reject m, it received a 
data message from a member process p, containing values of a and last such 
that last(s — m.scq ) = rejected. To send such a data message, pj must have 
received from po a token grant message containing the same values of s and 

last. By the definition of Ro, m 6 Ro- Thus, /i => □ I\- 

We now prove / 2 =>■ □ / 2 . h can become false only if the expression 
mi ,m 2 ,p.: (mi < m 2 Ami £ M,)Am 2 G M, becomes true for some messages 
mi and m 2 and member process p,; that is, pi delivers a message m 2 to its 
client but has not (yet) delivered m x to its client, where m x and m 2 have 
both been accepted by po and m^.seq < m 2 .seq. By L 3 , we know that 
m x 6 Mi V mi e R,, and since by assumption mi g Mi, we know that 
m\ G R{. By I\, we know that m x G Ro, but since m x < m 2 we know that 
m x G Ao- This is a contradiction (it violates L 2 ), so / 2 ^ □ J 2 . 

Showing Liveness To show liveness, we will first assume that for the 
master CD(t > next) which implies D(status(t — 1) = null). We will then 
show the effects when t is assumed to have a more reasonable value. 

Property AB-2 is expressed in terms of the set of processes 5; we will 
define this set a s pi € S = conn(po, Pi)- Rewriting, we get 

I 3 : □ (\ im,pi,pj : p { ,p } G C: send{m,pi ) A 

□ (conn(po,Pi) A conn(po,Pj)) => O m € M } ) 

To show I 3 , we will need the following five liveness properties, of which 
/ 4 , J 5 and Is imply I 3 : 
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h- □ (Vm,p,: p, £ C: send(m,p,) A □ (conn(p 0 ,pi)) => O m £ Ao) 

/ 5 : □ pj £ C: m 6 A 0 A □ ( conn(po,p } )) => O m £ A } ) 

/ 6 : □ {'im,p J \ pj £ C: m £ Rq a □ (conn(po,Pj)) => O m £ R } ) 

Ij ■ O (Vm: O (m £ Ao V m £ -Do)) 

!$'■ 1=1 Pj € C: m £ Aj A □ (conn(po, p,)) O m £ Mj) 

For brevity, only an informal proof for I 5 will be shown. If there are 
no p } that satisfy / 5 , then the lemma is vacuously true, so we will assume 
that there is at least one such pj, say pk- By assumption, the producer 
on pfc will eventually request a message to be sent, and by finite progress 
Pk will eventually send a message to po requesting a token. By fairness and 
connectivity, po will eventually select the guard (status(t— 1) = null). By the 
definition of Ao, m £ Ao ^ status(next — m.seq) = accepted, which is passed 
back to pk in the token grant message (again by fairness and connectivity). 

By finite progress, p* will send a message containing the value of last, 
and since connectivity is an equivalence, any pj is connected to pk, and will 
therefore receive this message. Then, by finite progress pj will eventually 
set m £ Aj, and the lemma holds. 

The effect of letting t to be smaller than the maximum sequence number 
is that a nonfaulty process pj that is connected to po may not satisfy AB— 2; 
in particular, / 5 and I 6 may not hold. A sequence of / > t token requests by 
processes that appear to fail after having been granted a token will generate 
a sequence of / rejected messages. However, when pj receives message m, it 
only sets the status for messages m': m'.seq >m.seq - 1, so there will be some 
message whose status will remain pending. Eventually, nextln - next Out 
will be greater than t and nextOut will point to the pending message, forcing 
Pj to rejoin. Thus, the algorithm is live only if there are no sequences of 
rejected messages with a length of / > t. 
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data(n) 
data(n + 1) 
data(n + w - 1) 


data(n + w) 
data(n + w + 1) 
data(n + 2 w — 1) 


empty(n + 2w): n..n + w - 1 
can be released 

data(n + 2 w) with eora: 
n + 2..n + w + 2 can be released 


window w = 3 
retention r = 2 
heartbeat h 


Figure 4: Normal Data Transmission 
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retention r = 2 
heartbeat h 


Figure 5: NAKs and Retransmission 
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